When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications
نویسندگان
چکیده
Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) – independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer’s disease studies, we show empirical results in regimes suggested by our analysis, where pooling a locally acquired dataset with data from an international study improves power.
منابع مشابه
When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications (Supplementary Material)
Remarks on transformations in pre-processing step: For all i ∈ {1, ..., k}, after applying the transformation (shift correction), we pool (Xi, yi) together to estimate β∗. Note that in general the transformation (shift correction) should not depend on the responses yi, otherwise we get a dependence on the noise. To see this, notice that yi = Xiβi + i where Xi is the transformed set of features....
متن کاملSimultaneous robust estimation of multi-response surfaces in the presence of outliers
A robust approach should be considered when estimating regression coefficients in multi-response problems. Many models are derived from the least squares method. Because the presence of outlier data is unavoidable in most real cases and because the least squares method is sensitive to these types of points, robust regression approaches appear to be a more reliable and suitable method for addres...
متن کاملNonparametric Learning in High Dimensions
This thesis develops flexible and principled nonparametric learning algorithms to explore, understand, and predict high dimensional and complex datasets. Such data appear frequently in modern scientific domains and lead to numerous important applications. For example, exploring high dimensional functional magnetic resonance imaging data helps us to better understand brain functionalities; infer...
متن کاملIntegrating Local and Global Error Statistics for Multi-Scale RBF Network Training: An Assessment on Remote Sensing Data
BACKGROUND This study discusses the theoretical underpinnings of a novel multi-scale radial basis function (MSRBF) neural network along with its application to classification and regression tasks in remote sensing. The novelty of the proposed MSRBF network relies on the integration of both local and global error statistics in the node selection process. METHODOLOGY AND PRINCIPAL FINDINGS The ...
متن کاملRegression spline bivariate probit models: A practical approach to testing for exogeneity
Bivariate probit models can deal with a problem usually known as endogeneity. This issue is likely to arise in observational studies when confounders are unobserved. We are concerned with testing the hypothesis of exogeneity (or absence of endogeneity) when using regression spline recursive and sample selection bivariate probit models. Likelihood ratio and gradient tests are discussed in this c...
متن کامل